Across-thread out-of-order instruction dispatch in a multithreaded microprocessor

ABSTRACT

Instruction dispatch in a multithreaded microprocessor such as a graphics processor is not constrained by an order among the threads. Instructions for each thread are fetched, and a dispatch circuit determines which instructions in the buffer are ready to execute. The dispatch circuit may issue any ready instruction for execution, and an instruction from one thread may be issued prior to an instruction from another thread regardless of which instruction was fetched first. If multiple functional units are available, multiple instructions can be dispatched in parallel.

CROSS-REFERENCES TO RELATED APPLICATIONS

This application is a continuation-in-part of application Ser. No.10/742,514, filed Dec. 18, 2003, entitled “Across-Thread Out-of-OrderInstruction Dispatch in a Multithreaded Processor,” which disclosure isincorporated herein by reference for all purposes.

BACKGROUND OF THE INVENTION

The present invention relates in general to multithreadedmicroprocessors, and in particular to dispatching instructions forexecution in a multithreaded microprocessor without regard to orderamong threads.

To meet the needs of video gainers, simulation creators, and otherprogram designers, sophisticated graphics co-processors have beendeveloped for a variety of computer systems. These processors, whichgenerally operate under the control of a general-purpose centralprocessing unit or other master processor, are typically optimized toperform transformations of scene data into pixels of an image that canbe displayed on a standard raster-based display device. In a commonconfiguration, the graphics processor is provided with “geometry data,”which usually includes a set of primitives (e.g., lines, triangles, orother polygons) representing objects in a scene to be rendered, alongwith additional data such as textures, lighting models, and the like.The graphics processor performs modeling, viewpoint, perspective,lighting, and similar transformations on the geometry data (this stageis often referred to as “vertex” processing). After thesetransformations, “pixel” processing begins. During pixel processing, thegeometry data is converted to raster data, which generally includescolor values and other information for each sample location in an arraycorresponding to the viewable area; further transformations may beapplied to the raster data, including texture blending and downfiltering(reducing the number of sample locations to correspond to the number ofpixels in the display device). The end result is a set of color valuesthat can be provided to the display device.

To provide smooth animations and a real-time response, graphicsprocessors are generally required to complete these operations for a newframe of pixel data at a minimum rate of about 30 Hz. As images becomemore realistic—with more primitives, more detailed textures, and soon—the performance demands on graphics processors increase.

To help meet these demands, some existing graphics processors implementa multithreaded architecture that exploits parallelism. As an example,during vertex processing, the same operations are usually performed foreach vertex; similarly, during pixel processing, the same operations areusually performed for each sample location or pixel location. Operationson the various vertices (or pixels) tend to be independent of operationson other vertices (pixels); thus, each vertex (pixel) can be processedas a separate thread executing a common program. The common programprovides a sequence of instructions to execution units in an executioncore of the graphics processor; at a given time, different threads maybe at different points in the program sequence. Since the execution time(referred to herein as latency) of an instruction may be longer than oneclock cycle, the execution units are generally implemented in apipelined fashion so that a second instruction can be issued before allpreceding instructions have finished, as long as the second instructiondoes not require data resulting from the execution of an instructionthat has not finished.

In such processors, the execution core is generally designed to fetchinstructions to be executed for the different active threads in around-robin fashion (i.e., one instruction from the first thread, thenone from the second, and so on) and present each fetched instructionsequentially to an issue control circuit. The issue control circuitholds the fetched instruction until its source data is available and theexecution units are ready, then issues it to the execution units. Sincethe threads are independent, round-robin issue reduces the likelihoodthat an instruction will depend on a result of a still-executinginstruction. Thus, latency of an instruction in one thread can be hiddenby fetching and issuing an instruction from another thread. Forinstance, a typical instruction might have a latency of 20 clock cycles,which could be hidden if the core supports 20 threads.

However, round-robin issue does not always hide the latency. Forexample, pixel processing programs often include instructions to fetchtexture data from system memory. Such an instruction may have a verylong latency (e.g., over 100 clock cycles). After a texture fetchinstruction is issued for a first thread, the issue control circuit maycontinue to issue instructions (including subsequent instructions fromthe first thread that do not depend on the texture fetch instruction)until it comes to an instruction from the first thread that requires thetexture data. This instruction cannot be issued until the texture fetchinstruction completes. Accordingly, the issue control circuit stopsissuing instructions and waits for the texture fetch instruction to becompleted before beginning to issue instructions again. Thus, “bubbles”can arise in the execution pipeline, leading to idle time for theexecution units and inefficiency in the processor.

One way to reduce this inefficiency is by increasing the number ofthreads that can be executed concurrently by the core. This, however, isan expensive solution because each thread requires additional circuitry.For example, to accommodate the frequent thread switching that occurs inthis parallel design, each thread is generally provided with its owndedicated set of data registers. Increasing the number of threadsincreases the number of registers required, which can add significantlyto the cost of the processor chip, the complexity of the design, and theoverall chip area. Other circuitry for supporting multiple threads,e.g., program counter control logic that maintains a program counter foreach thread, also becomes more complex and consumes more area as thenumber of threads increases.

It would therefore be desirable to provide an execution corearchitecture that efficiently and effectively reduces the occurrence ofbubbles in the execution pipeline without requiring substantialincreases in chip area.

BRIEF SUMMARY OF THE INVENTION

Embodiments of the present invention provide systems and methods fordispatching instructions in a multithreaded microprocessor (such as agraphics processor) in a manner that is not constrained by an orderamong the threads. Instructions for the various threads are fetched,e.g., into an instruction buffer that is configured to store at leastone instruction from each of the threads. A dispatch circuit determineswhich of the fetched instructions are ready to execute and may issue anyinstruction that is ready. Thus, an instruction from any one thread maybe issued prior to an instruction from another thread, regardless ofwhich instruction was fetched first.

According to an aspect of the present invention, a method for executingmultiple threads in a multithreaded processor includes defining multiplethreads, each of which executes a sequence of program instructions. Afirst instruction for a first one of the threads and a secondinstruction for a second one of the plurality of threads are eachfetched. The first instruction has a latency period associatedtherewith, and during the latency period associated with the firstinstruction, the second instruction is issued. The first instruction andthe second instruction are issued in an order independent of an order inwhich the first and second instructions were fetched.

In some embodiments, the first thread and the second thread executedifferent programs. In other embodiments, the first thread and thesecond thread execute the same program on different input data, and thefirst instruction and the second instruction might be instructions fromdifferent portions of the same program.

In some embodiments, the first instruction is issued to a firstfunctional unit in the multithreaded processor and the secondinstruction is issued to a second functional unit in the multithreadedprocessor.

In some embodiments, during the latency period associated with the firstinstruction, a third instruction is issued. The first instruction, thesecond instruction, and the third instruction are issued in an orderindependent of an order of instruction fetch. The third instructionmight be an instruction for a third one of the threads, an instructionfor the second thread, or an instruction for the first thread; it shouldbe noted that two instructions from the same thread can be issued inconsecutive processing cycles. In some embodiments, instructions withina thread are always issued in order. Further, in some cases the secondinstruction and the third instruction can be issued in parallel.

According to another aspect of the present invention, a method forexecuting multiple threads in a multithreaded processor includesdefining multiple threads, each of which executes a sequence of programinstructions. Instructions are fetched, including a first instructionfor a first one of the threads, a second instruction for a second one ofthe threads, and a third instruction for a third one of the threads. Inone example, the first instruction is fetched subsequently to the thirdinstruction. The first instruction is issued to a first functional unitin the multithreaded processor prior to issuing the third instruction,and in parallel with issuing the first instruction, the secondinstruction is issued to a second functional unit in the multithreadedprocessor. The third instruction can be issued at a later time, e.g.,during a latency period associated with one of the first instruction orthe second instruction.

According to a further aspect of the present invention, a microprocessoris configured for parallel processing of multiple threads, each of whichexecutes a sequence of program instructions. The microprocessor includesan execution module, a fetch circuit, and an issue circuit. Theexecution module is adapted to execute instructions for all of thethreads. The fetch circuit is adapted to fetch instructions from asequence of program instructions for each of the threads. The issuecircuit is adapted to issue the instructions fetched by the fetchcircuit to the execution module. The instructions for different ones ofthe plurality of threads are issued in an order independent of an orderin which the instructions for the different ones of the plurality ofthreads were fetched. The issue circuit is advantageously adapted suchthat, during a latency period associated with a first issued instructionfor a first one of the threads, the issue circuit can issue at least oneinstruction for a second one of the threads. In some embodiments, thefetch circuit is adapted to fetch a subsequent instruction for a firstthread in response to the issue circuit issuing a previously fetchedinstruction for the first thread.

In some embodiments, the execution module includes a plurality offunctional units and the issue circuit is further adapted to issue atleast two instructions in parallel, each of the instructions issued inparallel being directed to a different one of the functional units. Theinstructions issued in parallel might (or might not) be for differentthreads. The maximum number of instructions issuable in parallel can beless than the number of functional units in the execution module.

The following detailed description together with the accompanyingdrawings will provide a better understanding of the nature andadvantages of the present invention.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a simplified high-level block diagram of a computer systemaccording to an embodiment of the present invention;

FIG. 2 is a simplified block diagram of an instruction fetch circuit andinstruction buffer according to an embodiment of the present invention;

FIG. 3 is a simplified block diagram of a selection logic circuit forselecting an instruction to fetch according to an embodiment of thepresent invention;

FIG. 4 is a simplified block diagram of an instruction fetch circuitaccording to an alternative embodiment of the present invention;

FIG. 5 is a simplified block diagram of an instruction dispatch circuitaccording to an embodiment of the present invention;

FIG. 6 is a simplified block diagram of selection logic for selecting aninstruction to issue according to an embodiment of the presentinvention;

FIG. 7 is a simplified block diagram of an instruction issuer andexecution module according to an embodiment of the present inventionthat supports superscalar instruction issue; and

FIG. 8 is a snapshot view of instructions for different threads thatmight be ready at the same time, illustrating a principle of diversityof work according to an embodiment of the present invention.

DETAILED DESCRIPTION OF THE INVENTION

Embodiments of the present invention provide systems and methods fordispatching instructions in a multithreaded microprocessor (such as agraphics processor) in a manner that is not constrained by an orderamong the threads. Instructions for the various threads are fetched,e.g., into an instruction buffer that is configured to store at leastone instruction from each of the threads. A dispatch circuit determineswhich of the fetched instructions are ready to execute and may issue anyinstruction that is ready. Thus, an instruction from any one thread maybe issued prior to an instruction from another thread, regardless ofwhich instruction was fetched first.

FIG. 1 is a block diagram of a computer system 100 according to anembodiment of the present invention. Computer system 100 includes acentral processing unit (CPU) 102 and a system memory 104 communicatingvia a bus 106. User input is received from one or more user inputdevices 108 (e.g., keyboard, mouse) coupled to bus 106. Visual output isprovided on a pixel based display device 110 (e.g., a conventional CRTor LCD based monitor) operating under control of a graphics processingsubsystem 112 coupled to system bus 106. A system disk 128 and othercomponents, such as one or more removable storage devices 129 (e.g.,floppy disk drive, compact disk (CD) drive, and/or DVD drive), may alsobe coupled to system bus 106. System bus 106 may be implemented usingone or more of various bus protocols including PCI (Peripheral ComponentInterconnect), AGP (Accelerated Graphics Port) and/or PCI-Express(PCI-E); appropriate “bridge” chips such as a conventional north bridgeand south bridge (not shown) may be provided to interconnect variouscomponents and/or buses.

Graphics processing subsystem 112 includes a graphics processing unit(GPU) 114, a graphics memory 116, and scanout control logic 120, whichmay be implemented, e.g., using one or more integrated circuit devicessuch as programmable processors and/or application specific integratedcircuits (ASICs). GPU 114 may be configured to perform various tasks,including generating pixel data from graphics data supplied via systembus 106, interacting with graphics memory 116 to store and update pixeldata, and the like. Relevant features of GPU 114 are described furtherbelow.

Scanout control logic 120 reads pixel data from graphics memory 116 (or,in some embodiments, system memory 104) and transfers the data todisplay device 110 to be displayed. In one embodiment, scanout occurs ata constant refresh rate (e.g., 80 Hz); the refresh rate can be a userselectable parameter. Scanout control logic 120 may also perform otheroperations such as adjusting color values for particular displayhardware; generating composite screen images by combining the pixel datawith data for a video or cursor overlay image or the like obtained,e.g., from graphics memory 116, system memory 104, or another datasource (not shown); converting digital pixel data to analog signals forthe display device; and so on. It will be appreciated that theparticular configuration of graphics processing subsystem 112 is notcritical to the present invention.

During operation of system 100, CPU 102 executes various programs, suchas operating system (OS) programs, and application programs, as well asa driver program for graphics processing subsystem 112. These programsmay be of generally conventional design. For instance, the graphicsdriver program may implement one or more standard application programinterfaces (APIs), such as Open GL, Microsoft DirectX, or D3D forcommunication with graphics processing subsystem 112; any number orcombination of APIs may be supported, and in some embodiments separatedriver programs may be provided to implement different APIs. By invokingappropriate API function calls, operating system programs and/orapplication programs instruct the graphics driver program to transfergraphics data or pixel data to graphics processing subsystem 112 viasystem bus 106, to invoke various rendering functions of GPU 114, and soon. The specific commands and/or data transmitted to graphics processingsubsystem 112 by the graphics driver program in response to an APIfunction call may vary depending on the implementation of GPU 114, andthe graphics driver program may also transmit commands and/or dataimplementing additional functionality (e.g., special visual effects) notcontrolled by operating system or application programs.

In accordance with an embodiment of the present invention, GPU 114 isconfigured for concurrent processing of a large number of threads, whereeach thread corresponds to an independent sequence of processinginstructions. GPU 114 can issue a next instruction from any of thethreads at any given time. In some embodiments, GPU 114 issues oneinstruction per cycle; in other embodiments, multiple instructions fromdifferent threads (or in some instances from the same thread) can beissued in parallel.

For example, multiple threads might be used to process vertices of animage, with concurrent threads executing the same processing program(s)on different data for the image; at a given time, different ones of thethreads may be at different points in the program. In some embodiments,there may be multiple thread types, where all threads of one typeperform the same processing program and threads of different typesperform different processing programs. For example, there may be a“vertex” thread type whose processing program includes geometry andlighting transformations and a “pixel” thread type whose processingprogram includes texture blending and downfiltering of oversampled data.

In the embodiment of FIG. 1, GPU 114 includes a number of independentexecution cores 118, each of which is configured to process instructionsreceived from a number of threads (not shown). The maximum number ofconcurrent threads supported by GPU 114 is the number of cores 118multiplied by the number of threads per core; for instance, in oneembodiment, there are eight cores 118, each of which can support up to16 threads, for a total of 128 concurrently executing threads. Thenumber of cores and number of threads may be varied; for example, theremay be eight cores, each supporting 32 threads (256 total threads); tencores, each supporting 24 threads (240 total threads); 16 cores, eachsupporting 24 threads (384 total threads), and so on.

Each execution core 118 includes an instruction cache 132, aninstruction fetch circuit 136, a buffer 138, a dispatch circuit 140, anexecution module 142 that includes a set of execution units (not shown),and a register file 144. Instruction cache (Icache) 132, which may be ofgenerally conventional design, stores executable instructions that areobtained, e.g., from graphics memory 116. Each instruction in Icache 132may be identified using a program counter (PC) value; in one embodiment,the PC value includes a base PC value associated with the program towhich the instruction belongs and an offset PC value specific to theinstruction. In some embodiments, Icache 132 may be logically divided,with instructions from different programs occupying different logicalsubdivisions within Icache 132. Fetch circuit 136 fetches instructionsfrom Icache 132 for all threads processed by execution core 118,maintaining a sequential program order within each thread, and suppliesthe fetched instructions to a buffer 138. On each clock cycle, dispatchcircuit 140 selects an instruction from buffer 138 to be issued toexecution module 142.

In one embodiment, buffer 138 is configured to store at least oneinstruction per thread and to maintain the sequential program order foreach thread. On each clock cycle, dispatch circuit 140 selects one ofthe instructions from buffer 138 for execution, obtains the sourceoperands from register file 144, and forwards the instruction andoperands to execution module 142 for execution. Dispatch circuit 140advantageously selects a next instruction to execute based on whichinstructions in buffer 138 have their source operands available inregister file 144 and may select instructions without regard for whichthread is the source of the selected instruction. Fetch circuit 136monitors buffer 138 and fetches new instructions to replace instructionsthat have issued from buffer 138. In one embodiment, after aninstruction for a particular thread has issued from buffer 138, fetchcircuit 138 fetches a subsequent instruction for that thread. As aresult, for a given clock cycle, instructions from most or all of theactive threads may be available in buffer 138, and dispatch circuit 140may select an instruction from any thread, regardless of which threadwas last selected. Specific embodiments of fetch circuit 136, buffer 138and dispatch circuit 140 are described below.

Execution module 142 may be of generally conventional design and mayinclude any number of individual execution units. Some or all of theexecution units may be configured for single-instruction multiple-data(SIMD) operation as is known in the art. Execution module 142 receivesan instruction and its source operands from dispatch circuit 140,processes the source operands in accordance with the instruction, andstores result data in register file 144. Register file 144advantageously includes a separate set of registers for each threadprocessed by execution unit 118, thereby avoiding the need to swap datain and out of registers when switching from one thread to another. Datawritten to register file 144 becomes available as source operands forsubsequent instructions. The instructions may vary in character and mayinclude any number of source operands and any amount and/or kind ofresult data.

Each instruction generally has a certain latency associated with it;that is, the execution units of execution module 142 require a certainnumber of clock cycles (which may be one or more) to process theinstruction and write the result data to register file 144. Differentinstructions may have different latencies. For example, a simple vectoradd operation may be completed in only one or two clock cycles, while atexture fetch operation may require a large number (e.g., 100 or more)of cycles. Execution units of execution module 142 are advantageouslyimplemented in a pipelined architecture so that an instruction can bedispatched on each clock cycle notwithstanding the latency; sucharchitectures are known in the art. Different ones (or groups) of theexecution units may be specially adapted to process particularinstructions, as is known in the art, and dispatch circuit 140 mayselect an appropriate one (or group) of execution units within executionmodule 142 to process a particular instruction.

The instructions of a thread may have data dependencies on otherinstructions of that thread; that is, one instruction may use resultdata of a previous instruction as its source operand. An instructionwith a data dependency cannot execute until the result data from theinstruction on which it depends is available in register file 144. If aninstruction with such a data dependency is next for a particular thread,that thread is blocked. In accordance with an embodiment of the presentinvention, dispatch circuit 140 detects a blocked thread and selects thenext instruction of a different thread (which may be any thread that isnot blocked) from buffer 138 to be issued next, rather than waiting forthe blocked thread to become unblocked. In this manner, latency withinone thread can be hidden by executing another thread, so that theefficiency of GPU 114 is improved.

In addition to execution core 118, GPU 114 may also include otherfeatures not shown in FIG. 1, such as circuitry for receiving andresponding to commands received via system bus 106; such circuitry maybe configured to initiate and/or terminate threads in execution core 118as appropriate. Various control registers, status registers, data cachesand the like may be provided on a global, per-core, or per-thread basis.Such features are known in the art, and a detailed description isomitted as not being crucial to understanding the present invention.

It will be appreciated that the system described herein is illustrativeand that variations and modifications are possible. A graphics processormay be implemented using any suitable technologies, e.g., as one or moreintegrated circuit devices. A graphics processor may be mounted on anexpansion card (which may include one or more such processors) orintegrated into a system chipset (e.g., into the north bridge chip). Thegraphics processing subsystem may include any amount of dedicatedgraphics memory (some implementations may have no dedicated graphicsmemory) and may use system memory and dedicated graphics memory in anycombination. Further, in some embodiments, the graphics processor may beconfigurable to perform general-purpose computations, e.g., by providingan appropriate driver program.

The number of execution cores in the graphics processor isimplementation dependent, and optimal choices generally depend ontradeoffs between performance and cost. Each execution core may supportconcurrent operation of one or more thread types; where multiple coresare provided, different cores in the same processor may be configuredidentically or differently. The cores are advantageously implemented asindependent sub-processors that do not share execution units, and agiven thread is executed in one core.

The number of threads in a given core may also be varied according tothe particular implementation and the amount of latency that is to behidden. In this connection, it should be noted that in some embodiments,instruction ordering can also be used to hide some latency. Forinstance, as is known in the art, compilers for graphics processor codecan be optimized to arrange the instructions of the program such that ifthere is a first instruction that creates data and a second instructionthat consumes the data, one or more other instructions that do notconsume the data created by the first instruction are placed between thefirst and second instructions. This allows processing of a thread tocontinue while the first instruction is executing. It is also known inthe art that, for instructions with long latencies, it is usually notpractical to place enough independent instructions between creator andconsumer to fully hide the latency. In determining the number of threadsper core, consideration may be given to the availability (or lackthereof) of such optimizations; e.g., the number of threads supported bya core may be decided based on the maximum latency of any instructionand the average (or minimum or maximum) number of instructions that aparticular compiler can be expected to provide between a maximum-latencyinstruction and its first dependent instruction.

The instruction cache for an execution core may be shared among thethreads or may be physically or logically divided among them. Inaddition, where the core supports multiple thread types, the instructioncache may include a physical and/or logical division corresponding toeach thread type, and each division may be further subdivided (or not)among individual threads of that type as desired.

The register file for an execution core advantageously includes a set ofregisters for each thread and may have any number of read and/or writeports. In addition, physically and/or logically separate register filesmay be provided for different threads.

While the configuration of fetch circuit 136, buffer 138, and dispatchcircuit 140 may also be varied, specific examples will now be described.FIG. 2 is a simplified block diagram of fetch circuit 140 and buffer 138for an execution core 118 according to an embodiment of the presentinvention. In this embodiment, execution core 118 is configured toprocess up to a maximum number (N) of threads concurrently, although itis to be understood that at any given time some or all of the N threadsmay be idle or inactive.

Fetch circuit 136 includes a number (N) of program counter logic blocks202 and an arbitration unit 204 controlled by selection logic circuit206. (Herein, multiple instances of like objects are denoted withreference numbers identifying the object and parenthetical numbersidentifying the instance where needed.)

Each program counter logic block 202 generates a program counter (PC)value for a next sequential instruction in a respective one of the Nthreads. Program counter logic blocks 202 may be of generallyconventional design for updating a program counter and may includeincremental counters, branch detection logic, and other features notcritical to the present invention.

The PC values generated by PC logic blocks 202 are presented toarbitration unit 204, which selects the PC signal PCi (where 0≦i≦n−1)from one of the threads (denoted for reference herein as thread i) inresponse to a selection signal SELi provided by selection block 206(described below). The selected signal PCi is transmitted to Icache 132,which returns the corresponding instruction to buffer 138, and theidentifier (i) of the corresponding thread is transmitted to buffer 138.

Buffer 138 includes N storage locations 208 (which may be implemented,e.g., using registers), one of which corresponds to each of the Nthreads, and an array 210 configured to store N valid bits (one for eachregister). Buffer 138 receives the instruction (INST) from Icache 132and the thread identifier (i) of the corresponding thread fromarbitration unit 204 and directs the instruction INST to the one oflocations 208 that corresponds to thread i. When the instruction isstored, the corresponding valid bit in array 210 is set to logical true(e.g., “1”).

Buffer 138 is advantageously configured such that dispatch circuit 140may select an instruction from any one of storage locations 208 to beissued, so that instructions from different threads may be issued in anyorder. Dispatch circuit 140 is described below; for now it should benoted that when the instruction for a particular thread is issued, thecorresponding valid bit in array 210 is advantageously set to logicalfalse (e.g., “0”). As used herein, a “valid thread” is one that has avalid instruction in storage locations 208 and an “invalid thread” isone that does not.

As shown in FIG. 2, selection logic circuit 206 receives the valid bitsof array 210 from buffer 138. Selection logic circuit 206 uses validityor invalidity of each thread in selecting the thread i for which aninstruction is to be fetched. For example, selection logic circuit 206may be configured to select only invalid threads; where multiple threadsare invalid, selection logic circuit 206 may select the thread that hasbeen invalid longest or may select a thread based on a priority rankingamong the threads, where the priority ranking varies from one clockcycle to the next.

Selection logic circuit 206 may also include a rule limiting thefrequency with which a particular thread can be selected, e.g., in orderto prevent one thread from disproportionately consuming resources. Forexample, one rule might provide that a given thread is ineligible forreselection until at least M clock cycles have elapsed since it was lastselected, where M is some fixed number (which may be established, e.g.,as a configurable parameter of the processor). Where such a rule isimplemented, there may be clock cycles in which no threads satisfy theselection rules (e.g., the only invalid thread was selected fewer than Mcycles ago). In this event, arbitration unit 204 may send no PCi valueto Icache 132 for that clock cycle; the next PCi value is sent during asubsequent cycle when a satisfactory thread is found. In one suchembodiment, where one thread is selected per clock cycle, M is set to avalue that is not larger than the minimum number of threads expected tobe active at a given time, thereby reducing the likelihood of a clockcycle in which no thread is selected. In some embodiments, a particularhardware implementation may inherently limit the frequency with which athread can be selected without requiring additional control logic and/orconfiguration software.

FIG. 3 is a simplified block diagram of a selection logic circuit 300implementing thread selection rules according to an embodiment of thepresent invention. Selection logic circuit 300 includes a priorityencoder 302 and a phase (or token) counter 304. The valid signal foreach thread is inverted by a respective inverter 306, and the resulting/valid signals are provided to priority encoder 302. Priority encoder302, which may be implemented using conventional digital logiccircuitry, selects the highest-priority thread for which the /validsignal is asserted (i.e., the highest-priority invalid thread), wherethe priority ranking among the threads is determined based on a controlsignal (CTL) provided by phase counter 304. Phase counter 304 is amodulo N counter that increments on every clock cycle; the controlsignal CTL corresponds to the current value of phase counter 304. Inthis embodiment, control signal CTL determines the thread number of thehighest-priority thread, and priority encoder 302 ranks the remainingthreads in order of ascending (or descending) thread numbers, modulo N.

Because phase counter 304 increments at each clock cycle, the priorityranking of the threads is different for different clock cycles. Forexample, during a first clock cycle, current thread counter 304 hasvalue 0, and priority encoder 302 gives highest priority to thread 0. Inother words, during the first clock cycle, if thread 0 is invalid,priority encoder 302 generates a state of the SELi signal that selectsthread 0. If thread 0 is valid, thread 1 is considered next, and so onuntil an invalid thread is found or a maximum number of threads (whichmay be less than or equal to N) has been considered. During the nextclock cycle, current thread counter 304 has value 1, and priorityencoder 302 gives highest priority to thread 1, then to thread 2 ifthread 1 is valid, and so on.

Once a thread becomes invalid, it remains invalid until its nextinstruction is fetched. Thus, while selection logic circuit 300 does notguarantee that, on any given clock cycle, the thread that has beeninvalid longest is selected, it will be appreciated that any thread thatbecomes invalid will be selected within N clock cycles of becominginvalid. In some embodiments, the maximum number C of threads thatpriority encoder 302 considers during a clock cycle may be limited to anumber smaller than the total number N of threads. As long as thepriority rotates regularly, each thread will be among the C threadsconsidered for an approximately equal fraction of cycles and willeventually be selected. (It should be noted that in embodiments where Cis less than N, if the issued instruction on a given cycle is for thehighest-priority thread, then that thread would not be considered in thenext cycle.)

It will be appreciated that the selection logic circuit and selectionrules described herein are illustrative and that variations andmodifications are possible. The various circuit components describedherein may be implemented using conventional digital logic circuitdesigns and technologies. Different logic circuits may also beimplemented to support different selection rules. For example, inembodiments where more than one instruction may be fetched per clockcycle, the priority encoder may be configured to select multiple threadsper clock cycle. Moreover, devices other than priority encoders may beused for determining which invalid thread to select. For instance, theselection logic circuit may maintain a “least recently valid” bit fieldthat is updated when a transition of one of the valid bits between thelogical true and logical false states is detected. In still otherembodiments, counters or similar circuits may be used to determineelapsed time since a thread became invalid and/or elapsed time since athread was last selected; comparison logic that operates on the countervalues may be provided to identify a least recently valid thread.

In addition, the selection logic may include additional circuitry thatinhibits selection of a thread between a selection time and a time whenthe corresponding instruction appears in buffer 138. For example, in theevent of an Icache miss, it may take several cycles to retrieve theinstruction from the main instruction store (or a secondary cache) andprovide it to buffer 138. In some embodiments, it may be desirable toinhibit reselection of that thread during this interval, e.g., toprevent instructions within a thread from being provided to buffer 138and/or issued out of their program order. It should be noted thatbecause fetch circuit 136 does not select threads in a round robinfashion, instructions from other threads may continue to be fetched tobuffer 138 and issued while fetching of instructions for a thread thatencountered an Icache miss is inhibited. Thus, some embodimentsdescribed herein can avoid pipeline bubbles and inefficiency in theevent of an Icache miss.

Where multiple thread types are supported, the selection logic may takethread type into account or not, as desired. For example, in theembodiment shown in FIG. 2, information about thread types is notprovided to selection logic circuit 206. FIG. 4 is a block diagram of afetch circuit 400 according to an alternative embodiment of the presentinvention that takes thread type into account. In this embodiment, theexecution core also supports N threads, which may include up to Kthreads of a first type (“A”) and up to N-K threads of a second type(“B”).

A type A arbitration unit 402 receives program counter signals from theactive type A threads (numbered for reference purposes as 0 to K−1), anda type B arbitration unit 404 receives program counter signals from theactive type B threads (numbered for reference purposes as K to N−1).Type A arbitration unit 402 selects one of the type A threads inresponse to a selection signal from selection logic circuit 406, andtype B arbitration unit 404 selects one of the type B threads inresponse to a selection signal from selection logic circuit 408. In oneembodiment, the configuration of each of selection logic circuits 406,408 is generally similar to that described above with reference to FIG.3 so that each selection logic circuit 406, 408 selects the thread ofits respective type that has been invalid the longest; it will beappreciated that other configurations and selection rules may also beused. As described above, depending on the selection rules, there may beclock cycles for which one (or both) of arbitration units 402, 404 doesnot select any thread.

In response to the selection signals from selection logic circuits 406,408, type A arbitration unit 402 and type B arbitration unit 404 providerespective selected program counter values (PCa, PCb) to a globalarbitration unit 410. Arbitration units 402, 404 also advantageouslyidentify the respective threads (a, b) that were selected. Globalarbitration unit 410 selects between PCa and PCb in response to a typeselection signal (A/B) generated by a thread-type priority circuit 412.

Thread-type priority circuit 412 may be configured in various ways todefine a desired relative priority between thread types A and B. In oneembodiment, thread type priority circuit 412 may be configured to giveequal priority to both, e.g., by selecting PCa and PCb on alternatingclock cycles. In another embodiment, thread type priority circuit 412may select the least recently valid of the two candidate threads.

In yet another embodiment, thread type priority circuit 412 givespriority to one or the other thread type based on static or dynamic“importance” criteria. Various criteria may be used. For example, if thethread types correspond to pixel threads and vertex threads, it may bedesirable to give priority to vertex threads (e.g., because some pixelthreads might not be able to be initiated until processing of a relevantvertex thread has been completed). Thus, one selection rule might alwayschoose a vertex thread over a pixel thread. Another selection rule mightbe defined as a repeating sequence of some number of vertices followedby some number of pixels (e.g., two vertices then one pixel, or threevertices then two pixels, or, more generally, v vertices followed by ppixels for arbitrary integers v and p). Importance can also be defineddynamically, e.g., depending on the number of vertex and/or pixelthreads that are currently active or that are currently awaitingprocessing. Selection rules for thread type priority circuit 412 may bemade configurable to support optimization for a particular systemimplementation.

Global arbitration unit 410 selects between PCa and PCb based on typeselection signal A/B and provides the selected program counter value(labeled PCi) to Icache 132 substantially as described above. In someembodiments, the type selection signal A/B may occasionally specifythread type A (or B) during a clock cycle in which no thread of type A(B) was selected by the type-specific arbiter 402 (404). Globalarbitration unit 110 may be configured to select PCb (PCa) in this eventor to select no thread (i.e., no PCi is sent to Icache 132).

It will be appreciated that the fetch circuit and buffer describedherein are illustrative and that variations and modifications arepossible. Where different threads (or different thread types) havephysically or logically separate instruction caches, the fetch circuitmay be configured to direct the selected PC value to the appropriatecache, or to provide a thread (or thread type) identifier that can beused to select the appropriate cache. The buffer may provide storage formore than one instruction per thread, e.g., by providing a FIFO registerfor each thread, and the fetch circuit may select a next thread to fetchbased on the number of invalid or unused entries in each of the FIFOs.

In some embodiments, it is not necessary for the fetch circuit toprefill the buffer to any particular level prior to instruction issue.Instead, the buffer may tend to fill naturally as instruction issueoccasionally skips clock cycles due to data dependencies and the like.The thread selection logic of the fetch circuit is advantageouslyconfigured to select threads only when space exists in the buffer for anstoring instruction from that thread, thereby avoiding buffer overflow.

FIG. 5 is a simplified block diagram of a dispatch circuit 140 accordingto an embodiment of the present invention. Dispatch circuit 140 includesa scoreboard circuit 502, a scheduler 504, and an issue circuit (orissuer) 506. Scoreboard circuit 502, which may be of generallyconventional design, reads each of the (valid) instructions in buffer138 and receives signals from register file 144 indicating whichregisters contain valid data. Using the valid-data signals, scoreboardcircuit 502 determines, for each instruction in buffer 138, whether thesource operands are available in register file 144. Scoreboard circuit502 generates a set of ready signals (e.g., one bit per thread)indicating which instructions in buffer 138 are ready to be executed,i.e., have their source operands available in register file 144.Scheduler 504 receives the ready signals from scoreboard 502 and thevalid signals from buffer 138 and selects a next instruction todispatch. The selected instruction is dispatched to issuer 506, whichissues the instruction by forwarding it to execution module 142. Thethread identifier of the thread to which the selected instructionbelongs may also be forwarded to issuer 506 and/or execution module 142,e.g., to enable selection of the appropriate registers for the sourceoperands and result data.

Scheduler 504 is advantageously configured to select among the readyinstructions in buffer 138 with few or no constraints based on an orderamong threads. For example, scheduler 504 may select the readyinstruction in buffer 138 that has been waiting (valid) longest,regardless of when that thread was last selected.

FIG. 6 is a simplified block diagram of a selection logic circuit 600that may be included in scheduler 504 for selecting a thread to bedispatched from buffer 138. Selection logic circuit 600 includes apriority encoder 602 and a phase (or token) counter 604. The validsignal and the ready signal for each thread are provided as inputs to arespective AND circuit 606. Priority encoder 602 receives the outputsignals from AND circuits 606, i.e., a signal for each thread that isasserted when the thread's instruction in buffer 138 is valid and readyto be executed. (In some embodiments, the ready signal for a thread isnot asserted when the thread is invalid, so that AND circuits 606 may beomitted.) Priority encoder 602, which may be implemented usingconventional digital logic circuitry, selects the highest-prioritythread for which the ready and valid signals are both asserted (i.e.,the highest priority ready thread), where the priority ranking among thethreads is determined based on a control signal (CTL2) provided by phasecounter 604. Phase counter 604 is a modulo N counter that increments onevery clock cycle; the control signal CTL2 corresponds to the currentvalue of counter 604. In this embodiment, control signal CTL2 determinesthe thread number of the highest-priority thread, and priority encoder602 ranks the remaining threads in order of ascending (or descending)thread numbers, modulo N. Phase counter 604 may have the same phase ascurrent thread counter 304 of FIG. 3 (both counters may be implementedas the same counter if desired), or it may have a different phase.

Operation of priority encoder 602 is similar to that described above forpriority encoder 302 of FIG. 3, and because phase counter 604 incrementsat each clock cycle, the priority ranking of the threads is differentfor different clock cycles. For example, during a first clock cycle,current thread counter 604 has value 0, and priority encoder 602 giveshighest priority to thread 0 (i.e., selects thread 0 if thread 0 isready), then to thread 1 if thread 0 is not ready, and so on until aready thread is found or a maximum number of threads is considered.During the next clock cycle, current thread counter 604 has value 1, andpriority encoder 602 gives highest priority to thread 1, then to thread2 if thread 1 is not ready, and so on.

Once a thread becomes ready, it remains ready until its instruction isdispatched. Thus, while selection logic circuit 600 does not guaranteethat, on any given clock cycle, the thread that has been ready longestis selected, it will be appreciated that any thread that becomes ready(and valid) will be selected within N clock cycles of becoming ready. Insome embodiments, it may be desirable to prevent the same thread frombeing selected during consecutive clock cycles; accordingly, the maximumnumber of threads that priority encoder 602 considers during a clockcycle may be limited to a number smaller than the total number N ofthreads. (This maximum number may also be a configurable parameter ofthe system.)

It will be appreciated that the selection logic circuit and selectionrules described herein are illustrative and that variations andmodifications are possible. The various circuit components describedherein may be implemented using conventional digital circuit designs andtechnologies. Different logic circuits may also be implemented tosupport different selection rules. For example, in superscalarembodiments (where more than one instruction may be issued per clockcycle), the selection logic may be configured to select multipleinstructions per clock cycle. Moreover, devices other than priorityencoders may be used for determining which ready thread to select. Forinstance, the selection logic circuit may maintain a “least recentlyinvalid” bit field that is updated when a transition of one of the validbits between the logical true and logical false states is detected; thisbit field may be used to select the ready instruction that has beenvalid the longest. In still other embodiments, counters may be used todetermine elapsed time since a thread became valid (or ready) and/orelapsed time since a thread was last selected; comparison logic thatoperates on the counter values may be provided to identify the readythread that has been valid the longest.

In still other embodiments, other kinds of selection rules may beimplemented. For instance, selection may be based in part on thread type(e.g., using selection logic similar to that shown in FIG. 4 above).Selection may also be based in part on the type of operation to beperformed (e.g., giving different priorities to a MULTIPLY operation, aCALL operation, an ADD operation and so on). In addition, selection maytake into account the state of the execution module. In one suchembodiment, execution module 142 contains specialized execution units(or execution pipes), with different operations being directed todifferent execution units; e.g., there may be an execution unit thatperforms floating-point arithmetic and another that performs integerarithmetic. If the execution unit needed by a ready instruction for onethread is busy, an instruction from a different thread may be selected.For instance, suppose that at a given time, the floating-point pipelineis busy and the integer pipeline is free. A thread with aninteger-arithmetic instruction ready can be given priority over a threadwith a floating-point instruction.

Further, rules governing priority among the threads (at the fetch and/ordispatch stages) may be varied. For instance, in some embodiments,priority does not rotate among the threads on every clock cycle. In someembodiment, priority rotates only after a clock cycle during which aninstruction is issued. Highest priority may rotate, e.g., to the nextthread number after the thread for which an instruction was issued or tothe next sequential thread number, regardless of the thread to which theissued instruction belonged. In still another such embodiment, priorityrotates only when execution of a thread ends (either normally or due tosome invalidating condition). For example, the first (“oldest”) threadto begin executing may be assigned highest priority and may retainhighest priority until it completes; subsequent (“younger”) threads areassigned decreasing priority according to the order in which they beginexecution. In this embodiment, instructions for younger threads areissued only on cycles where the next instruction for any older thread isnot ready. Some embodiments support multiple rules governing priority,with the particular rule to be applied being a configurable systemparameter.

Referring again to FIG. 5, in response to the grant signal fromscheduler 504, the requested instruction in buffer 138 is dispatched toissuer 506. In one embodiment, issuer 506 includes an operand collector508 and a buffer 510. Buffer 510 receives the dispatched instruction,and operand collector 508 collects source operands for the instructionsin buffer 510 from register file 144. Depending on the configuration ofregister file 144, collection of source operands may require multipleclock cycles, and operand collector 508 may implement various techniquesfor optimizing register file accesses for efficient operand collectiongiven a particular register file configuration; examples of suchtechniques are known in the art.

Buffer 510 is advantageously configured to store collected operandstogether with their instructions while other operands for theinstruction are being collected. In some embodiments, issuer 506 isconfigured to issue instructions to execution units 142 as soon as theiroperands have been collected. Issuer 506 is not required to issueinstructions in the order in which they were dispatched. For example,instructions in buffer 510 may be stored in a sequence corresponding tothe order in which they were dispatched, and at each clock cycle issuer506 may select the oldest instruction that has its operands by steppingthrough the sequence (starting with the least-recently dispatchedinstruction) until an instruction that has all of its operands is found.This instruction is issued, and instructions behind it in the sequenceare shifted forward; newly dispatched instructions are added at the endof the sequence. The sequence may be maintained, e.g., by an ordered setof physical storage locations in buffer 510, with instructions beingshifted to different locations as preceding instructions are removed.

In one embodiment, an instruction that has been dispatched to issuer 506remains in buffer 138 until it has been issued to execution module 142.After dispatch, the instruction is advantageously maintained in a validbut not ready state (e.g., the valid bit 210 for a dispatchedinstruction may remain in the logical true state until the instructionis issued). It will be appreciated that in embodiments where issuer 506may issue instructions out of the dispatch order, this configuration canhelp to prevent multiple instructions from the same thread from beingconcurrently present in buffer 510, thereby preserving order ofinstructions within a thread.

In other embodiments, issuer 506 does not perform operand collection.For example, issuer 506 may issue instructions to execution module 142(or specific execution units thereof) as they are received and signalregister file 144 to provide the appropriate source operands toexecution module 142 (or specific execution units thereof). In thisembodiment, operand collector 508 and buffer 510 may be omitted. It willbe appreciated that the particular configuration of issuer 506 is notcritical to understanding the present invention.

It will be appreciated that the dispatch circuit described herein isillustrative and that variations and modifications are possible. Thevarious logic circuits described herein for the scheduler circuit may beimplemented using conventional digital circuit designs and technologies.Different logic circuits may also be implemented to support differentselection rules. The scheduler may also include various kinds of logiccircuitry implementing additional selection rules, e.g., a minimumnumber of cycles before a thread can be reselected for issue, and/ordifferent selection rules, e.g., giving priority to one thread type overanother. Such rules may be implemented using logic circuitry andtechniques similar to those described above in the context of threadselection for the fetch circuit.

Some alternative embodiments support superscalar (i.e., more than oneper clock cycle) instruction issue, allowing issuer 506 to deliver twoor more instructions per clock cycle to execution module 142. In orderto keep up with the rate of issue, dispatch circuit 140 of FIG. 5 isadvantageously modified to dispatch two or more ready instructions eachclock cycle to buffer 510 of issuer 506. For instance, priority encoder602 of FIG. 6 can be modified to generate a grant signal for each of thetwo highest-priority ready threads in buffer 138, resulting in twoinstructions per cycle being delivered to buffer 510 of issuer 506.

FIG. 7 is a simplified block diagram of an instruction issuer andexecution module according to an embodiment of the present inventionthat supports superscalar instruction issue. In this embodiment,execution module 142 includes multiple functional units 701, 702, 703.Each functional unit implements a pipeline that performs a differentinstruction or class of instructions. For example, in one embodiment,functional unit 701 performs various integer and floating-pointarithmetic operations (addition, multiplication, etc.), as well asBoolean logical operations, comparisons, format conversion, etc.;functional unit 702 performs planar interpolation as well as fastfunction approximation operations (e.g., for sine, cosine, logarithms,and the like); and functional unit 703 performs texture fetching andblending operations. Some functions (e.g., register-to-register moves,floating-point multiply, etc.) may be implemented in multiple ones offunctional units 701-703, and an instruction may be issued to anyfunctional unit capable of executing it.

Functional units 701-703 may also differ from each other in the lengthof their respective pipelines. For instance, in one embodiment,functional units 701 and 702 are each implemented as 10-cycle or15-cycle pipelines, while functional unit 703 is implemented as apipeline of 100 cycles or more. In addition, the functional units mightalso have different throughputs. For example, functional unit 701 mightbe able to accept a new instruction and to produce a result (after theappropriate latency) every clock while functional unit 703 can accept anew instruction and produce a result (after the appropriate latency)every tenth clock. In each case, issuer 706 issues to a particularfunctional unit only as frequently as that functional unit can accept anew instruction; suitable techniques for controlling rate of issue on aper-functional-unit basis are known in the art and may be used in thisembodiment.

In this embodiment, up to two instructions can be dispatched in parallelfrom buffer 138 (see FIG. 4) and loaded into buffer 510 (see FIG. 5) ofissuer 506. Operand collector 508 collects the operands as describedabove. Issuer 506 can issue up to two instructions for which operandshave been collected to functional units 701-703 in parallel, with eachfunctional unit receiving zero or one instruction. The instructionsissued in a given clock cycle may be drawn from the various threads inany combination desired. In some embodiments, issue may be limited toone instruction per thread per clock cycle, while other embodiments mayallow multiple instructions from the same thread to be issued in thesame clock cycle.

It will be appreciated that the superscalar instruction issue logicdescribed herein is illustrative and that variations and modificationsare possible. For instance, the number of functional units is notlimited to three, and the number of instructions issued in parallel isnot limited to two. An arbitrary number (P) of instructions may beissued to an arbitrary number (X) of functional units, provided onlythat P≦X. Similarly, the fetch circuit may also be modified to fetchmore than one instruction per clock cycle. Thus, it is to be understoodthat the present invention includes embodiments that fetch an arbitrarynumber (F) of instructions and issue an arbitrary number of instructions(P) each cycle, where the numbers F and P may be allocated amongmultiple threads in any manner desired. Embodiments of the invention mayalso be adapted for use in asynchronous processors.

Those skilled in the art with access to the present teachings willrecognize that in certain embodiments of the present invention,instructions from different threads, including threads of differenttypes, can be freely interleaved in any order. Thus, an instruction fromone thread might be followed by another instruction from the same threador an instruction from any one of the other threads. To the extent thatthe threads are executing different programs or executing differentportions of the same program, the diversity of available work at anygiven time is increased. The increased diversity, in turn, increases thepossibilities for exploiting the full processing capacity of theexecution module and for using other instructions to hide latencyassociated with executing a first instruction.

FIG. 8 is a snapshot view of instructions for different threads thatmight be ready at the same time, illustrating a principle of diversityof work according to an embodiment of the present invention. In FIG. 8,it is supposed that a processing core executes N concurrent threads. Ofthese, threads 0 to K−1 are of a first type (e.g., vertex threads) whilethreads K to N−1 are of a second type (e.g., pixel threads). Programcounter table 802 shows, in column 804, the instruction sequence (OP-1,OP-2, etc.) for a vertex shader program common to threads 0 to K−1. PCblock 806 indicates, with an X in the appropriate row, the readyinstruction for each vertex thread. Similarly, program counter table 812shows, in column 814, the instruction sequence (OP-A, OP-B, etc.) for apixel shader program common to threads K to N−1. PC block 816 indicates,with an X in the appropriate row, the ready instruction for each pixelthread. It is to be understood that any of the threads not explicitlyshown in FIG. 8 may also have ready instructions at the snapshot time.

In one embodiment, functional units 701 and 702 can each accept aninstruction on every clock cycle, while functional unit 703 has a lowerinput rate, e.g., one every 10 clock cycles. Issuer 506 issues up to twoinstructions for any of the N threads to functional units 701-703 ofFIG. 7 on each clock cycle. Any two ready instructions can be issued aslong as the two instructions are destined for different ones offunctional units 701-703, and each of the destination functional unitsis ready to receive a new instruction. Since functional units 701 and702 can each accept a new instruction every clock cycle, these units701, 702 can be kept busy if one instruction for each of units 701 and702 is ready on each clock cycle. Functional unit 703 is kept busy if itreceives one instruction every 10 clock cycles.

Issuer 506 can be configured to determine, on each clock cycle, whetherfunctional unit 703 is ready for a new instruction. If so, and if asuitable instruction is ready, issuer 506 issues that instruction and aninstruction for one or the other of functional units 701 and 702. Iffunctional unit 703 is not ready or if no instruction for functionalunit 703 is ready, issuer 506 can issue one instruction each tofunctional units 701 and 702. It should be noted that in thisembodiment, there will be a few cycles where functional unit 701 or 702could have received a new instruction but did not. In an alternativeembodiment, issuer 506 can issue up to three instructions per clock, oneto each of functional units 701-703, and each functional unit receives anew instruction on every clock when the unit is ready and an instructionfor that unit is ready.

To maximize the likelihood that an instruction destined for one offunctional units 701-703 is ready whenever the functional unit is ready,fetch unit 400 of FIG. 4 advantageously keeps issuer 506 supplied with adiverse set of instructions from which to choose instructions to issue.For instance, in FIG. 8, issuer 506 can choose among N instructionsincluding OP-2 (for thread 1), OP-5 (for thread 0), OP-67 (for threadK−1), OP-A (for thread N−1), OP-C (for thread K+1), or OP-Q (for threadK), and so on for the other active threads.

In general, for a large enough value of N and typical shader programs,it can be expected that at least some of these ready instructions aredestined for different functional units. For instance, in someconventional rendering programs, only pixel shaders include textureprocessing instructions that, in this example, would be executed byfunctional unit 703. If only vertex threads were available, functionalunit 703 would be idle all of the time. Conversely, if only pixelthreads were available, functional unit 703 would be used, but otherfunctional units 701, 702 might go unused for some number of cycleswhile issuer 506 awaited the result of a texture-processing operationfrom functional unit 703.

In an embodiment of the present invention, issuer 820 can interleavetexture instructions from pixel shaders (table 812) with instructionsfrom vertex shaders (table 802) and thus keep the functional units701-703 more fully used than would be the case if only one thread typewere available. For instance, while functional unit 703 is executingtexture instructions for one or more pixel threads, functional units 701and 702 can be kept occupied executing other types of instructions forother types of threads. Since across-thread order does not limit thepool of available instructions, any number of instructions for anynumber of threads can be executed in the time it takes to execute onelong-latency instruction such as a texture instruction. Thus, allowingcore 800 to execute multiple thread types and to issue instructionswithout regard to across-thread order increases the diversity ofavailable work and decreases idle cycles.

It should also be noted that at least some diversity of work can alsoexist among threads of the same type. For instance, table 802 indicatesthat, at a given time, different vertex threads can be at very differentpoints in the same program. Thread 1, for instance, is near thebeginning of the program, while thread K−1 is at a later point.Similarly, table 812 indicates that, at a given time, different pixelthreads can be at very different points in the same program. Thissituation arises in part due to the ability to issue instructions fromdifferent threads out of any thread order; even threads that are startedon consecutive clock cycles do not necessarily stay “in step” with eachother as execution proceeds.

To the extent that different portions of a program include differentmixes of instructions, divergence of the execution points (or programcounters) for different threads of the same type can also increasediversity of work. For instance, referring to table 812, suppose thatinstructions OP-A through OP-E are arithmetic instructions destined forfunctional unit 701 of FIG. 7, while instruction OP-Q is a long-latencytexture instruction to be executed by functional unit 703. After, or inparallel with, issuing OP-Q for thread K, issuer 820 can issue, e.g.,OP-A for thread N−1, followed by OP-C for thread K+1, and so on. Thus,diverse work from threads of the same type may also be used to maximizethe degree to which the processing capacity of execution module 142 isused.

It is to be understood that the number of thread types, number ofthreads, and other features illustrated in FIG. 8 are illustrative. Theinvention is not limited to two thread types; in some embodiments, anexecution core might concurrently execute threads of more or fewertypes. For instance, in one alternative embodiment, the execution corecan concurrently execute vertex threads, pixel threads, and geometrythreads that execute “geometry shader” programs. In one embodiment of ageometry shader thread, the input data is a primitive or other groupingof vertices, and the output data is a grouping of vertices that maycontain more, fewer, or the same number of vertices as the original.

A thread's “type,” as used herein, can be determined by reference to thetype of input data that it processes; for instance, a “vertex” threadprocesses vertex data while a “pixel” thread processes pixel data. Insome alternative embodiments, a thread's type might be determined byreference to the program it executes; for instance, threads can beconsidered of the same type if they execute the same program. It shouldbe noted that in some embodiments, one or another of the shader programsmight be changed during a rendering operation, and threads executing theold shader and the new shader can coexist in the same execution core.Threads of the old and new shader programs might be considered as beingof the same type (since both process the same type of input data), orthey might be considered as being of different type (since they executedifferent programs), depending on implementation.

While the invention has been described with respect to specificembodiments, one skilled in the art will recognize that numerousmodifications are possible. For instance, out-of-order instruction issuewithin a thread may be implemented if desired, e.g., by adaptingout-of-order issue techniques from general-purpose processors that allowissue of any ready instruction within an “active window.” For instance,two or more instructions per thread could be loaded into buffer 510 ofissuer 506 of FIG. 5. If, in some clock cycle, the oldest instructionfor the thread with highest priority cannot be issued (e.g., because thedestination functional unit is not ready or because an operand has notyet been collected), a newer instruction for that thread that can beissued might be selected instead.

The execution cores described herein are not limited to any particularnumber or configuration of execution units. For example, multipleexecution units may collaborate to process a given instruction,different execution units may receive different instructions (or thesame instruction with different data) in parallel, and so on. Theexecution units may process instructions with fixed or variable latencyand may be pipelined to accept new instructions every clock cycle or,more generally, at intervals consisting of some fixed number of clockcycles.

As noted above, any number of threads and any number of thread types maybe supported, with each thread type corresponding to a programmedsequence of instructions to be executed. Program instructions may beprovided in various ways, including built-in instructions stored innonvolatile memory of the graphics processor or other graphicsprocessing subsystem components, instructions supplied by a graphicsdriver program at system initialization and/or runtime, and/orapplication-supplied program code (e.g., in the case of a programmableshader). Programs may be created in suitable high-level languages (e.g.,C, Cg, or the like) and compiled using an appropriate compiler for theprogramming language and the graphics processor on which the program isto be executed. Translation of input instructions to a different format(or a different instruction set) that is compatible with the executionunits may be provided within the execution core, within other componentsof the graphics processor, or elsewhere in the computer system.

In some embodiments, some or all of the threads may be executed usingsingle-instruction, multiple-dispatch (SIMD) techniques, thereby furtherincreasing parallelism in an execution core without requiring additionalinstruction fetching or program counter logic. For instance, eachfunctional unit in an execution core may be implemented on a SIMDpipeline capable of performing identical operations on multiple sets ofinput operands (e.g., 8, 16, 32 sets) in parallel. Multiple threads ofthe same program can be executed as a “SIMD group.” Instructions for theSIMD group can be fetched, dispatched and issued as if the group were asingle thread. Thus, if the fetch and issue logic supports N concurrentthreads while the functional units are each M-way SIMD pipelines, thetotal number of concurrent threads in the core could be as many as N*M.Such SIMD parallelism, however, is not critical to the presentinvention.

Graphics processors as described herein may be implemented asco-processors in a wide array of computing devices, including generalpurpose desktop, laptop, and/or tablet computers; various handhelddevices such as personal digital assistants (PDAs), mobile phones, etc.;special-purpose computer systems such as video game consoles; and thelike. In some embodiments, it is possible to leverage the executioncore(s) of a graphics processor for general-purpose computing operationsthat might or might not be related in any way to image generation.Accordingly, although vertex, pixel, and/or geometry shaders as might befound in a rendering application are used as examples herein, it is tobe understood that a thread may execute any program, not limited toshader programs or graphics-related programs.

It will also be appreciated that, although the invention has beendescribed with reference to graphics processors, the systems and methodsdescribed herein may also be implemented in other multithreadedmicroprocessors.

Thus, although the invention has been described with respect to specificembodiments, it will be appreciated that the invention is intended tocover all modifications and equivalents within the scope of thefollowing claims.

1.-20. (canceled)
 21. A processor configured for parallel processing ofa plurality of threads, the processor comprising: a priority encoderconfigured to assigning a priority ranking to each of a plurality ofdifferent thread types of the plurality of threads; a fetch moduleconfigured to: fetch a first instruction for a first one of theplurality of threads; and fetch, after the first instruction is fetched,a plurality of additional instructions for additional ones of theplurality of threads; and a dispatch module, communicatively coupledwith the priority encoder and the fetch module, and configured to:issue, during a latency period associated with the first thread, asecond instruction of the plurality of additional instructions, thesecond instruction selected for issue based at least in part on thepriority ranking associated with the second instruction's thread and anamount of time the second instruction has been ready to issue; and issuethe first instruction after the second instruction is issued.
 22. Theprocessor of claim 21, wherein the second instruction's thread is of afirst type assigned a priority ranking higher than priority rankings ofeach other of the additional plurality of threads.
 23. The processor ofclaim 21, wherein the second instruction's thread is of a first typedifferent from types associated with each other of the additionalplurality of threads.
 24. The processor of claim 23, wherein the fetchmodule fetches the second instruction after fetching each other of theadditional plurality of threads.
 25. The processor of claim 21, whereinthe plurality of different thread types are defined by a type of inputdata each thread type processes.
 26. The processor of claim 25, whereina first thread type of the plurality of different thread types and asecond thread type of the plurality of different thread types eachexecutes a same program on different input data.
 27. The processor ofclaim 25, wherein the plurality of different thread types comprise: avertex thread type assigned a first priority ranking; and a pixel threadtype assigned a second priority ranking.
 28. The processor of claim 21,wherein the plurality of different thread types are defined by a programeach thread type processes.
 29. The processor of claim 21, wherein thefirst instruction and the second instruction are instructions fromdifferent portions of a same program.
 30. The processor of claim 21,wherein the fetch module is further configured to fetch instructionsfrom the plurality of threads according to a priority ranking forfetching thread types.
 31. The processor of claim 30, wherein thepriority ranking for fetching thread types comprises the priorityranking used by the dispatch module.
 32. The processor of claim 21,wherein the processor comprises a graphics processor.
 33. A method forexecuting a plurality of threads in a multithreaded processor, themethod comprising: assigning a priority ranking to each of a pluralityof different thread types of the plurality of threads; fetching a firstinstruction for a first one of the plurality of threads; fetching, afterthe first instruction is fetched, a plurality of additional instructionsfor additional ones of the plurality of threads; during a latency periodassociated with the first thread, issuing a second instruction of theplurality of additional instructions, the second instruction selectedfor issue based at least in part on the priority ranking associated withthe second instruction's thread and an amount of time the secondinstruction has been ready to issue; and issuing the first instructionafter the second instruction is issued.
 34. The method of claim 33,wherein the second instruction's thread is of a first type assigned apriority ranking higher than priority rankings for types associated witheach other of the additional plurality of threads.
 35. The method ofclaim 34, wherein a fetch module fetches the second instruction afterfetching each other of the additional plurality of threads.
 36. Themethod of claim 33, wherein the plurality of different thread types aredefined by a different type of input data each thread type processes.37. The method of claim 36, wherein the plurality of different threadtypes comprise: a vertex thread type assigned a first priority ranking;and a pixel thread type assigned a second priority ranking indicatingless importance than the first priority ranking.
 38. The method of claim33, wherein the plurality of different thread types are defined by adifferent program each thread type processes.
 39. The method of claim33, wherein the multithreaded processor comprises a graphics processor.40. A device for executing a plurality of threads in a multithreadedprocessor, the device comprising: means for assigning a priority rankingto each of a plurality of different thread types of the plurality ofthreads; means for fetching a first instruction for a first one of theplurality of threads; means for fetching, after the first instruction isfetched, a plurality of additional instructions for additional ones ofthe plurality of threads; means for issuing, during a latency periodassociated with the first thread wherein the first instruction fails tobecome ready to issue, a second instruction of the plurality ofadditional instructions, the second instruction selected for issue basedat least in part on the priority ranking associated with the secondinstruction's thread and an amount of time the second instruction hasbeen ready to issue; and means for issuing the first instruction afterthe second instruction is issued.